Introduction

Of all the applications of machine-learning, diagnosing any serious disease using a black box is always going to be a hard sell. If the output from a model is the particular course of treatment (potentially with side-effects), or surgery, or the absence of treatment, people are going to want to know why.

This dataset gives a number of variables along with a target condition of having or not having heart disease. Below, the data is first tried to be predicted using a losgistic regression model, then a random forest model. However, decision tree is selcted as the mode appropriate model.

Data Description age: The person’s age in years

sex: The person’s sex (1 = male, 0 = female)

cp: The chest pain experienced (Value 1: typical angina, Value 2: atypical angina, Value 3: non-anginal pain, Value 4: asymptomatic)

trestbps: The person’s resting blood pressure (mm Hg on admission to the hospital)

chol: The person’s cholesterol measurement in mg/dl

fbs: The person’s fasting blood sugar (> 120 mg/dl, 1 = true; 0 = false)

restecg: Resting electrocardiographic measurement (0 = normal, 1 = having ST-T wave abnormality, 2 = showing probable or definite left ventricular hypertrophy by Estes’ criteria)

thalach: The person’s maximum heart rate achieved

exang: Exercise induced angina (1 = yes; 0 = no)

oldpeak: ST depression induced by exercise relative to rest (‘ST’ relates to positions on the ECG plot. See more here)

slope: the slope of the peak exercise ST segment (Value 1: upsloping, Value 2: flat, Value 3: downsloping)

ca: The number of major vessels (0-3)

thal: A blood disorder called thalassemia (3 = normal; 6 = fixed defect; 7 = reversable defect)

target: Heart disease (0 = no, 1 = yes)

Presentation

library(tidyverse)
library(readr)
library(ROCR)
library(PerformanceAnalytics)
library(e1071)
library(caret)
library(gbm)
library(corrplot)
library(ggcorrplot)
library(MASS)
library(rpart)
library(caTools)
library(naivebayes)
library(class)
library(ISLR)
library(glmnet)
library(Hmisc)
library(funModeling)
library(pROC)
library(randomForest)
library(klaR)
library(scales)
library(cluster)
library(factoextra)
library(DataExplorer)
library(ClustOfVar)
library(GGally)
library(gmodels)
library(C50)
# reading the data

heart <- read.csv("heart.csv")
str(heart)
## 'data.frame':    303 obs. of  14 variables:
##  $ ï..age  : int  63 37 41 56 57 57 56 44 52 57 ...
##  $ sex     : int  1 1 0 1 0 1 0 1 1 1 ...
##  $ cp      : int  3 2 1 1 0 0 1 1 2 2 ...
##  $ trestbps: int  145 130 130 120 120 140 140 120 172 150 ...
##  $ chol    : int  233 250 204 236 354 192 294 263 199 168 ...
##  $ fbs     : int  1 0 0 0 0 0 0 0 1 0 ...
##  $ restecg : int  0 1 0 1 1 1 0 1 1 1 ...
##  $ thalach : int  150 187 172 178 163 148 153 173 162 174 ...
##  $ exang   : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ oldpeak : num  2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
##  $ slope   : int  0 0 2 2 2 1 1 2 2 2 ...
##  $ ca      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ thal    : int  1 2 2 2 2 1 2 3 3 2 ...
##  $ target  : int  1 1 1 1 1 1 1 1 1 1 ...
summary(heart)
##      ï..age           sex               cp           trestbps    
##  Min.   :29.00   Min.   :0.0000   Min.   :0.000   Min.   : 94.0  
##  1st Qu.:47.50   1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:120.0  
##  Median :55.00   Median :1.0000   Median :1.000   Median :130.0  
##  Mean   :54.37   Mean   :0.6832   Mean   :0.967   Mean   :131.6  
##  3rd Qu.:61.00   3rd Qu.:1.0000   3rd Qu.:2.000   3rd Qu.:140.0  
##  Max.   :77.00   Max.   :1.0000   Max.   :3.000   Max.   :200.0  
##       chol            fbs            restecg          thalach     
##  Min.   :126.0   Min.   :0.0000   Min.   :0.0000   Min.   : 71.0  
##  1st Qu.:211.0   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:133.5  
##  Median :240.0   Median :0.0000   Median :1.0000   Median :153.0  
##  Mean   :246.3   Mean   :0.1485   Mean   :0.5281   Mean   :149.6  
##  3rd Qu.:274.5   3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:166.0  
##  Max.   :564.0   Max.   :1.0000   Max.   :2.0000   Max.   :202.0  
##      exang           oldpeak         slope             ca        
##  Min.   :0.0000   Min.   :0.00   Min.   :0.000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.00   1st Qu.:1.000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.80   Median :1.000   Median :0.0000  
##  Mean   :0.3267   Mean   :1.04   Mean   :1.399   Mean   :0.7294  
##  3rd Qu.:1.0000   3rd Qu.:1.60   3rd Qu.:2.000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :6.20   Max.   :2.000   Max.   :4.0000  
##       thal           target      
##  Min.   :0.000   Min.   :0.0000  
##  1st Qu.:2.000   1st Qu.:0.0000  
##  Median :2.000   Median :1.0000  
##  Mean   :2.314   Mean   :0.5446  
##  3rd Qu.:3.000   3rd Qu.:1.0000  
##  Max.   :3.000   Max.   :1.0000

Methodology

Results and Discussion

## Here we are cleaning the data set and modifying the factors

data2 <- heart %>% 
  mutate(sex = if_else(sex == 1, "MALE", "FEMALE"),
         fbs = if_else(fbs == 1, ">120", "<=120"),
         exang = if_else(exang == 1, "YES" ,"NO"),
         cp = if_else(cp == 1, "ATYPICAL ANGINA",
                      if_else(cp == 2, "NON-ANGINAL PAIN", "ASYMPTOMATIC")),
         restecg = if_else(restecg == 0, "NORMAL",
                           if_else(restecg == 1, "ABNORMALITY", "PROBABLE OR DEFINITE")),
         slope = as.factor(slope),
         ca = as.factor(ca),
         thal = as.factor(thal),
         target = if_else(target == 1, "YES", "NO")
         ) %>% 
  mutate_if(is.character, as.factor) %>% 
  dplyr::select(target, sex, fbs, exang, cp, restecg, slope, ca, thal, everything())

summary(data2)
##  target        sex         fbs      exang                    cp     
##  NO :138   FEMALE: 96   <=120:258   NO :204   ASYMPTOMATIC    :166  
##  YES:165   MALE  :207   >120 : 45   YES: 99   ATYPICAL ANGINA : 50  
##                                               NON-ANGINAL PAIN: 87  
##                                                                     
##                                                                     
##                                                                     
##                  restecg    slope   ca      thal        ï..age     
##  ABNORMALITY         :152   0: 21   0:175   0:  2   Min.   :29.00  
##  NORMAL              :147   1:140   1: 65   1: 18   1st Qu.:47.50  
##  PROBABLE OR DEFINITE:  4   2:142   2: 38   2:166   Median :55.00  
##                                     3: 20   3:117   Mean   :54.37  
##                                     4:  5           3rd Qu.:61.00  
##                                                     Max.   :77.00  
##     trestbps          chol          thalach         oldpeak    
##  Min.   : 94.0   Min.   :126.0   Min.   : 71.0   Min.   :0.00  
##  1st Qu.:120.0   1st Qu.:211.0   1st Qu.:133.5   1st Qu.:0.00  
##  Median :130.0   Median :240.0   Median :153.0   Median :0.80  
##  Mean   :131.6   Mean   :246.3   Mean   :149.6   Mean   :1.04  
##  3rd Qu.:140.0   3rd Qu.:274.5   3rd Qu.:166.0   3rd Qu.:1.60  
##  Max.   :200.0   Max.   :564.0   Max.   :202.0   Max.   :6.20
logit.mode1 <- glm( target ~ . , family = "binomial", data = data2)

summary(logit.mode1)
## 
## Call:
## glm(formula = target ~ ., family = "binomial", data = data2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8019  -0.3504   0.1406   0.4323   2.9873  
## 
## Coefficients:
##                              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 -0.799378   3.176635  -0.252  0.80132    
## sexMALE                     -1.441798   0.521275  -2.766  0.00568 ** 
## fbs>120                      0.627653   0.555851   1.129  0.25882    
## exangYES                    -1.017763   0.429158  -2.372  0.01771 *  
## cpATYPICAL ANGINA            0.387536   0.558853   0.693  0.48803    
## cpNON-ANGINAL PAIN           1.350664   0.472046   2.861  0.00422 ** 
## restecgNORMAL               -0.370497   0.383120  -0.967  0.33352    
## restecgPROBABLE OR DEFINITE -1.322756   2.451173  -0.540  0.58944    
## slope1                      -0.551347   0.796351  -0.692  0.48872    
## slope2                       0.668520   0.885995   0.755  0.45052    
## ca1                         -2.215448   0.498913  -4.441 8.97e-06 ***
## ca2                         -3.164688   0.738587  -4.285 1.83e-05 ***
## ca3                         -2.542796   0.903922  -2.813  0.00491 ** 
## ca4                          0.890341   1.608751   0.553  0.57996    
## thal1                        2.249101   2.103214   1.069  0.28491    
## thal2                        2.426802   2.010214   1.207  0.22734    
## thal3                        0.924743   2.017278   0.458  0.64666    
## ï..age                       0.031697   0.024565   1.290  0.19694    
## trestbps                    -0.018380   0.011327  -1.623  0.10465    
## chol                        -0.005126   0.004034  -1.271  0.20382    
## thalach                      0.022643   0.010979   2.062  0.03918 *  
## oldpeak                     -0.272669   0.233259  -1.169  0.24242    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 417.64  on 302  degrees of freedom
## Residual deviance: 192.37  on 281  degrees of freedom
## AIC: 236.37
## 
## Number of Fisher Scoring iterations: 6
# when men have the heart attack, men come to ED with pain 
# while women have symptoms that are not nominally associated with heart attack
# moreover gender bias by physician plays a role
# blood sugar >120 is more prone
# old peak is not a sig variable

logit.mode2 <- glm( target ~ sex*cp + .  , family = "binomial", data = data2)

summary(logit.mode2)
## 
## Call:
## glm(formula = target ~ sex * cp + ., family = "binomial", data = data2)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.68994  -0.38316   0.07254   0.40307   2.99164  
## 
## Coefficients:
##                              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 -2.402996   3.582727  -0.671  0.50240    
## sexMALE                     -0.864383   0.633699  -1.364  0.17256    
## cpATYPICAL ANGINA            0.909081   0.972810   0.934  0.35005    
## cpNON-ANGINAL PAIN           3.439028   1.382793   2.487  0.01288 *  
## fbs>120                      0.685497   0.564049   1.215  0.22425    
## exangYES                    -0.890768   0.434208  -2.051  0.04022 *  
## restecgNORMAL               -0.411750   0.389731  -1.057  0.29074    
## restecgPROBABLE OR DEFINITE -2.069213   3.794360  -0.545  0.58552    
## slope1                      -0.514944   0.792875  -0.649  0.51604    
## slope2                       0.612706   0.891511   0.687  0.49191    
## ca1                         -2.339835   0.518361  -4.514 6.36e-06 ***
## ca2                         -3.079886   0.735950  -4.185 2.85e-05 ***
## ca3                         -2.378054   0.911009  -2.610  0.00904 ** 
## ca4                          0.791737   1.568229   0.505  0.61366    
## thal1                        3.104065   2.579220   1.203  0.22879    
## thal2                        3.437036   2.518750   1.365  0.17239    
## thal3                        1.825456   2.504979   0.729  0.46617    
## ï..age                       0.032036   0.024723   1.296  0.19505    
## trestbps                    -0.017222   0.011413  -1.509  0.13131    
## chol                        -0.006397   0.004385  -1.459  0.14457    
## thalach                      0.024882   0.011096   2.242  0.02493 *  
## oldpeak                     -0.260573   0.234493  -1.111  0.26647    
## sexMALE:cpATYPICAL ANGINA   -0.647946   1.170074  -0.554  0.57974    
## sexMALE:cpNON-ANGINAL PAIN  -2.503991   1.471032  -1.702  0.08872 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 417.64  on 302  degrees of freedom
## Residual deviance: 188.73  on 279  degrees of freedom
## AIC: 236.73
## 
## Number of Fisher Scoring iterations: 6
# the lower the AIC the better your model, the right direction in lower AIC

#split into test and train

set.seed(12345)
data2_rand <- data2[order(runif(303)), ]

# the 303 number is the number of rows

summary(data2$trestbps)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    94.0   120.0   130.0   131.6   140.0   200.0
summary(data2_rand$trestbps)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    94.0   120.0   130.0   131.6   140.0   200.0
str(data2)
## 'data.frame':    303 obs. of  14 variables:
##  $ target  : Factor w/ 2 levels "NO","YES": 2 2 2 2 2 2 2 2 2 2 ...
##  $ sex     : Factor w/ 2 levels "FEMALE","MALE": 2 2 1 2 1 2 1 2 2 2 ...
##  $ fbs     : Factor w/ 2 levels "<=120",">120": 2 1 1 1 1 1 1 1 2 1 ...
##  $ exang   : Factor w/ 2 levels "NO","YES": 1 1 1 1 2 1 1 1 1 1 ...
##  $ cp      : Factor w/ 3 levels "ASYMPTOMATIC",..: 1 3 2 2 1 1 2 2 3 3 ...
##  $ restecg : Factor w/ 3 levels "ABNORMALITY",..: 2 1 2 1 1 1 2 1 1 1 ...
##  $ slope   : Factor w/ 3 levels "0","1","2": 1 1 3 3 3 2 2 3 3 3 ...
##  $ ca      : Factor w/ 5 levels "0","1","2","3",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ thal    : Factor w/ 4 levels "0","1","2","3": 2 3 3 3 3 2 3 4 4 3 ...
##  $ ï..age  : int  63 37 41 56 57 57 56 44 52 57 ...
##  $ trestbps: int  145 130 130 120 120 140 140 120 172 150 ...
##  $ chol    : int  233 250 204 236 354 192 294 263 199 168 ...
##  $ thalach : int  150 187 172 178 163 148 153 173 162 174 ...
##  $ oldpeak : num  2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
set.seed(300)
data2_train <- data2_rand[1:240, ]
data2_test  <- data2_rand[241:303, ]

rf <- randomForest(target ~ ., data = data2_train)



rf
## 
## Call:
##  randomForest(formula = target ~ ., data = data2_train) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 19.58%
## Confusion matrix:
##     NO YES class.error
## NO  84  24   0.2222222
## YES 23 109   0.1742424
data2_pred_rf <- predict(rf, data2_test)
CrossTable(data2_test$target, data2_pred_rf,
           prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,
           dnn = c('actual heart disease', 'predicted heart disease'))
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  63 
## 
##  
##                      | predicted heart disease 
## actual heart disease |        NO |       YES | Row Total | 
## ---------------------|-----------|-----------|-----------|
##                   NO |        24 |         6 |        30 | 
##                      |     0.381 |     0.095 |           | 
## ---------------------|-----------|-----------|-----------|
##                  YES |         4 |        29 |        33 | 
##                      |     0.063 |     0.460 |           | 
## ---------------------|-----------|-----------|-----------|
##         Column Total |        28 |        35 |        63 | 
## ---------------------|-----------|-----------|-----------|
## 
## 
accuracy1=(32+22)/(63)
accuracy1
## [1] 0.8571429
# this model is 85% accurate
# this model accuatately predicted the presence of heart disease 85 percent of the time
# we have 3/63 false negatives


# try decision tree to det imp factor




# try without thal, thalach, ca, oldpeak, slope
# everyone gets ekg, fasting blood sugar, cholestorol

data2_dt <- C5.0(data2_train[-1], data2_train$target)

#plot will not help but try
plot(data2_dt)

# we can see the most critical factor is checking balance

# display simple facts about the tree
data2_dt
## 
## Call:
## C5.0.default(x = data2_train[-1], y = data2_train$target)
## 
## Classification Tree
## Number of samples: 240 
## Number of predictors: 13 
## 
## Tree size: 17 
## 
## Non-standard options: attempt to group attributes
# display detailed information about the tree
summary(data2_dt)
## 
## Call:
## C5.0.default(x = data2_train[-1], y = data2_train$target)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Sun Apr 19 23:25:52 2020
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 240 cases (14 attributes) from undefined.data
## 
## Decision tree:
## 
## thal = 2:
## :...ca in {0,4}:
## :   :...oldpeak <= 1.6: YES (83/4)
## :   :   oldpeak > 1.6:
## :   :   :...ï..age <= 61: NO (3)
## :   :       ï..age > 61: YES (2)
## :   ca in {1,2,3}:
## :   :...cp = NON-ANGINAL PAIN: YES (11/1)
## :       cp = ASYMPTOMATIC:
## :       :...sex = MALE: NO (16/3)
## :       :   sex = FEMALE:
## :       :   :...trestbps <= 134: YES (3)
## :       :       trestbps > 134: NO (2)
## :       cp = ATYPICAL ANGINA:
## :       :...restecg in {ABNORMALITY,PROBABLE OR DEFINITE}: YES (3)
## :           restecg = NORMAL:
## :           :...exang = NO: NO (3)
## :               exang = YES: YES (2)
## thal in {0,1,3}:
## :...oldpeak > 0.7: NO (74/9)
##     oldpeak <= 0.7:
##     :...ca = 4: YES (0)
##         ca in {1,2,3}: NO (14/4)
##         ca = 0:
##         :...cp in {ATYPICAL ANGINA,NON-ANGINAL PAIN}: YES (11/1)
##             cp = ASYMPTOMATIC:
##             :...chol > 237: NO (4)
##                 chol <= 237:
##                 :...ï..age <= 42: NO (2)
##                     ï..age > 42: YES (7)
## 
## 
## Evaluation on training data (240 cases):
## 
##      Decision Tree   
##    ----------------  
##    Size      Errors  
## 
##      16   22( 9.2%)   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##     102     6    (a): class NO
##      16   116    (b): class YES
## 
## 
##  Attribute usage:
## 
##  100.00% thal
##   83.33% oldpeak
##   69.17% ca
##   26.67% cp
##    8.75% sex
##    5.83% ï..age
##    5.42% chol
##    3.33% restecg
##    2.08% exang
##    2.08% trestbps
## 
## 
## Time: 0.0 secs
# thal is the principal attribute
# 0 is 








data2_train$thal = NULL
data2_test$thal = NULL



data2_dt <- C5.0(data2_train[-1], data2_train$target)

#plot will not help but try
plot(data2_dt)

# thal was removed because it is not the first thing donte to apatient
# sthal is a stress tet that is further down the road
# if you had a mole and doc suggests chemotherapy





data2_train$thalach = NULL
data2_test$thalach = NULL

data2_dt <- C5.0(data2_train[-1], data2_train$target)

#plot will not help but try
plot(data2_dt)

data2_train$ca = NULL
data2_test$ca = NULL

data2_dt <- C5.0(data2_train[-1], data2_train$target)

#plot will not help but try
plot(data2_dt)

data2_train$oldpeak = NULL
data2_test$oldpeak = NULL

data2_dt <- C5.0(data2_train[-1], data2_train$target)

#plot will not help but try
plot(data2_dt)

data2_train$slope = NULL
data2_test$slope = NULL

data2_dt <- C5.0(data2_train[-1], data2_train$target)

#plot will not help but try
plot(data2_dt)

#prestest prob vs post test prob

data2_pred <- predict(data2_dt, data2_test)

CrossTable(data2_test$target, data2_pred,
           prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,
           dnn = c('actual target', 'predicted target'))
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  63 
## 
##  
##               | predicted target 
## actual target |        NO |       YES | Row Total | 
## --------------|-----------|-----------|-----------|
##            NO |        26 |         4 |        30 | 
##               |     0.413 |     0.063 |           | 
## --------------|-----------|-----------|-----------|
##           YES |        10 |        23 |        33 | 
##               |     0.159 |     0.365 |           | 
## --------------|-----------|-----------|-----------|
##  Column Total |        36 |        27 |        63 | 
## --------------|-----------|-----------|-----------|
## 
## 
confusionMatrix(data2_pred, as.factor(data2_test$target))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction NO YES
##        NO  26  10
##        YES  4  23
##                                           
##                Accuracy : 0.7778          
##                  95% CI : (0.6554, 0.8728)
##     No Information Rate : 0.5238          
##     P-Value [Acc > NIR] : 2.846e-05       
##                                           
##                   Kappa : 0.5586          
##                                           
##  Mcnemar's Test P-Value : 0.1814          
##                                           
##             Sensitivity : 0.8667          
##             Specificity : 0.6970          
##          Pos Pred Value : 0.7222          
##          Neg Pred Value : 0.8519          
##              Prevalence : 0.4762          
##          Detection Rate : 0.4127          
##    Detection Prevalence : 0.5714          
##       Balanced Accuracy : 0.7818          
##                                           
##        'Positive' Class : NO              
## 
accuracy = (26+23)/63

accuracy
## [1] 0.7777778
################################################# create another decision tree by sex split


#Splitiing the view of the tree
library(partykit)
## Loading required package: grid
## Loading required package: libcoin
## Loading required package: mvtnorm
myTree2 <- C50:::as.party.C5.0(data2_dt)
branch1 = plot(myTree2[2])

branch2 = plot(myTree2[20])

Thal description Nuclear stress testing requires the injection of a tracer, commonly technicium 99M (Myoview or Cardiolyte), which is then taken up by healthy, viable myocardial cells. A camera (detector) is used afterwards to image the heart and compare segments. A coronary stenosis is detected when a myocardial segment takes up the nuclear tracer at rest, but not during cardiac stress. This is called a “reversible defect.” Scarred myocardium from prior infarct will not take up tracer at all and is referred to as a “fixed defect.”